UBMI-IFC Podcast Generator

An Automated Biomedical Research Podcast Generation System

Author

UBMI-IFC Team

Published

November 19, 2025

Modified

November 19, 2025

Scraper

Code
import json
import pandas as pd
import itables
from IPython.display import HTML, display

# Enable interactive mode for all DataFrames in the notebook
itables.init_notebook_mode(all_interactive=True)

# Silence itables typing warnings about undocumented 'options' argument
itables.options.warn_on_undocumented_option = False

# Cargar JSON
try:
    with open('./data/raw/all_ifc_publications.json', 'r', encoding='utf-8') as file:
        publications_scraper = json.load(file)
    print(f"Resultado de scraper: {len(publications_scraper)} publicaciones")
except FileNotFoundError:
    print("Error: './data/raw/all_ifc_publications.json' no existe.")
    publications_scraper = []
except json.JSONDecodeError:
    print("Error: formato inválido de JSON.")
    publications_scraper = []

# Si hay datos, convertir a DataFrame y mostrar con itables (si está disponible)
if publications_scraper:
    # Normalizar estructuras JSON anidadas a columnas planas
    df = pd.json_normalize(publications_scraper)

    # Mostrar en orden aleatorio
    df = df.sample(frac=1).reset_index(drop=True)

    # Inject CSS to truncate long text with ellipsis in the rendered DataTable
    css = """
    <style>
    /* Target DataTables cells rendered by itables */
    table.dataTable td, table.dataTable th {
      max-width: 180px;
      white-space: nowrap;
      overflow: hidden;
      text-overflow: ellipsis;
    }
    /* Allow horizontal scrolling for very wide tables */
    div.dataTables_wrapper {
      overflow-x: auto;
    }
    </style>
    """
    display(HTML(css))

    # DataTables options: fix column widths and enable horizontal scroll
    dt_options = {
        'autoWidth': False,
        'columnDefs': [{'targets': '_all', 'width': '180px'}],
        'scrollX': True,
        'pageLength': 25
    }

    itables.show(df, classes='stripe hover order-column', options=dt_options)
Resultado de scraper: 404 publicaciones
Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Expandir base de datos

Queremos expandir la base de datos buscando en PUBMED las publicaciones relacioneadas a la afiliación “IFC, UNAM”. PAra obtener los nombres de las variaciones del instituto, vamos a ver cómo se escribe en los artíóculos para mejorar la busqueda.

Obtener PDFs

Para esto, podemos usar una de las 3 variaciones reportadas en ‘notebooks/02_expand_database.ipynb’. Se utilizó la opción B (exportar las publicaciones a bib y de ahí importar a zotero y utilizar una extensión que descargue los PDFs).

Code
# Let's also check the actual BibTeX content
publicaciones_bibtex = './data/processed/all_ifc_publications.bib'

print("\n📄 Ejemplo de artículos en BibTeX:")
with open(publicaciones_bibtex, 'r', encoding='utf-8') as f:
    content = f.read()
    # Show first entry
    first_entry_end = content.find('\n}\n') + 3
    print(content[:first_entry_end])

📄 Ejemplo de artículos en BibTeX:
@article{Abbruzzini2021_ifc_235,
 abstract = {ABSTRACTPurpose: The production of Technosols is a sustainable strategy to reuse urban wastes and to regenerate degraded sites. However, little is known regarding the role of the activity of enzymes associated with carbon and nutrients cycling on organic degradation and microbial activity in these soils. Methods: A controlled experiment was conducted with Technosols made from construction wastes, wood chips, and compost or compost plus biochar, in order to evaluate their organic matter (OM) degradation potential and functioning through the activity of enzymes and microbial community composition. Results: The Technosols had organic carbon contents from 13 to 30 g kg−1, carbon-to-nitrogen ratio from 10 to 20, and available phosphorus from 92 to 376 mg kg−1. The Technosols with biochar and compost had alkaline pH and higher contents of organic carbon and available phosphorus compared to Technosols with compost alone. The mixture of wood chips and compost presented the highest enzyme activities, and might be the most appropriate for Technosol’s production. The mixture of concrete and excavation waste with compost and compost plus biochar displayed a potential for OM decomposition comparable to that of wood chips with compost plus biochar. These results suggest that the bacterial and archaeal fingerprint is similar among the Technosols, although differences are observed in the relative abundances of their taxa. Conclusions: Substrate composition affects the processes of OM transformation, microbial biomass activity, and composition. The mixture of wood chips and compost presented the highest enzyme activities during the incubation period, and might be the most appropriate for its application as a Technosol. The mixture of concrete and excavation waste with either compost or compost plus biochar displayed a potential for organic matter decomposition that was comparable to that of the mixture of wood chips with compost plus biochar. The microbial communities in these Technosols are not significantly different yet, but the bioavailability of nutrients derived from the changes in the soil matrix (by adding construction waste and biochar) is influencing soil enzymatic activity.},
 author = {Abbruzzini and T. F. and Reyes-Ortigoza and A. L. and Alcántara-Hernández and R. J. and Mora, L. and Flores, L. and & Prado, B.},
 doi = {10.1007/s11368-021-03062-2},
 journal = {Journal of Soils and Sediments},
 note = {Instituto de Fisiología Celular, UNAM},
 title = {Chemical, biochemical, and microbiological properties of Technosols produced from urban inorganic and organic wastes},
 url = {https://www.ifc.unam.mx/publicacion.php?scopus=85114778157},
 year = {2021}
}

PDF obtenidos:

Code
from pathlib import Path

dir_path = Path('papers/downloaded/zotero')
if not dir_path.exists():
    print(0)
else:
    count = sum(1 for p in dir_path.rglob('*') if p.is_file() and p.suffix.lower() == '.pdf')
    print(count)
345

Minado de afiliaciones

Ahora vamos a extraer a partir del texto de los PDFs (utilizando ‘PyMuPDF’) en conjunto con ‘spaCy’.

  • Utiliza regex y NLP para detectar las afiliaciones
  • Consideramos inglés y español (‘en_core_web_sm’ y ‘es_core_web_sm’)
  • A partir de las afiliaciones, expandiremos la búsqueda de Pubmed

Resultados:

Code
# Load the pre-filtered affiliation results
import json

def load_filtered_affiliations(min_score=15.0):
    """
    Load pre-filtered affiliation clusters for PubMed searches.
    
    Args:
        min_score: Minimum relevance score to include (default: 15.0 for high quality)
        
    Returns:
        List of affiliation terms optimized for PubMed searches
    """
    
    filtered_file = './data/processed/filtered_affiliations.json'
    
    print(f"📁 Cargando afiliaciones: {filtered_file}")
    
    with open(filtered_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    clusters = data['relevant_affiliation_clusters']
    
    # Filter by score and extract search terms
    affiliation_terms = []
    
    for cluster in clusters:
        if cluster['relevance_score'] >= min_score:
            
            # Add representative term
            representative = clean_affiliation_for_search(cluster['representative'])
            if representative:
                affiliation_terms.append(f'"{representative}"[Affiliation]')
            
            # Add top variations (limit to avoid too many terms)
            for variation in cluster['variations'][:3]:  # Top 3 variations per cluster
                cleaned = clean_affiliation_for_search(variation)
                if cleaned and len(cleaned) > 10:  # Only substantial terms
                    search_term = f'"{cleaned}"[Affiliation]'
                    if search_term not in affiliation_terms:  # Avoid duplicates
                        affiliation_terms.append(search_term)
    
    print(f"✅ Cargando {len(affiliation_terms)} términos de búsqueda de {len([c for c in clusters if c['relevance_score'] >= min_score])} clusters")
    
    return affiliation_terms

def clean_affiliation_for_search(term):
    """Clean an affiliation term for PubMed search."""
    import re
    
    if not term:
        return ""
    
    # Remove common noise patterns
    cleaned = re.sub(r'[•\d]+\s*', '', term)  # Remove bullets and leading numbers
    cleaned = re.sub(r'[^\w\s\-,.]', ' ', cleaned)  # Remove special chars except basic punctuation
    cleaned = re.sub(r'\s+', ' ', cleaned)  # Normalize whitespace
    cleaned = cleaned.strip()
    
    # Remove very generic prefixes
    prefixes_to_remove = ['the ', 'a ', 'an ', 'at the ', 'from the ']
    for prefix in prefixes_to_remove:
        if cleaned.lower().startswith(prefix):
            cleaned = cleaned[len(prefix):]
    
    # Skip very short or generic terms
    if len(cleaned) < 8:
        return ""
    
    generic_terms = ['university', 'institute', 'department', 'school', 'college']
    if cleaned.lower() in generic_terms:
        return ""
    
    return cleaned

# Load the filtered affiliations
print("🔍 Afiliaciones para búsqueda de PUBMED")
print("=" * 60)

filtered_affiliations = load_filtered_affiliations(min_score=15.0)

print(f"\n📋 TOP 10 términos:")
for i, term in enumerate(filtered_affiliations[:10], 1):
    print(f"{i:2d}. {term}")
🔍 Afiliaciones para búsqueda de PUBMED
============================================================
📁 Cargando afiliaciones: ./data/processed/filtered_affiliations.json
✅ Cargando 95 términos de búsqueda de 50 clusters

📋 TOP 10 términos:
 1. "Instituto de Fisiología Celular"[Affiliation]
 2. "Molecular Genetics, Instituto de Fisiología Celular"[Affiliation]
 3. "Universidad Nacional Aut onoma de México"[Affiliation]
 4. "Universidad Nacional Autónoma de México,"[Affiliation]
 5. "Universidad Nacional Aut onoma de M exico"[Affiliation]
 6. "Institute for Cellular Physiology"[Affiliation]
 7. "Institute of Cellular Physiology"[Affiliation]
 8. "Institute of Cellular Physiology at UNAM"[Affiliation]
 9. "Department of Biochemistry and Structural Biology, Institute of Cellular Physiology"[Affiliation]
10. "Department of Biochemistry and Structural Biology,"[Affiliation]

Después de revisar y limpiar manualmente los términos de búsqueda encontrados, hacemos la búsqueda en Pubmed y expandimos la base de datos:

Code
# Compare counts between two JSON files and show a table with the raw JSON contents for exploration

from pathlib import Path
import json
import pandas as pd
import itables
from IPython.display import HTML, display

itables.init_notebook_mode(all_interactive=True)
itables.options.warn_on_undocumented_option = False

def safe_load_json(path):
    try:
        with open(path, 'r', encoding='utf-8') as f:
            return json.load(f)
    except FileNotFoundError:
        print(f"⚠️ Archivo no encontrado: {path}")
        return None
    except json.JSONDecodeError:
        print(f"⚠️ JSON inválido: {path}")
        return None

def count_records(obj):
    """Heurística simple para contar entradas en un JSON cargado."""
    if obj is None:
        return 0
    if isinstance(obj, list):
        return len(obj)
    if isinstance(obj, dict):
        # If dict contains an obvious list of records, pick the largest list
        list_lengths = [len(v) for v in obj.values() if isinstance(v, list)]
        if list_lengths:
            return max(list_lengths)
        # fallback: count top-level keys as 1 record (or 0)
        return 1 if obj else 0
    return 0

# Paths to compare (if the "processed" file does not exist, we'll compare the same raw file)
path_a = Path('./data/raw/all_ifc_publications.json')
path_b = Path('./data/processed/pubmed_filtered_search_results.json')

if not path_b.exists():
    # Fall back to the same file if processed version is not present
    path_b = path_a

json_a = safe_load_json(path_a) or []
json_b = safe_load_json(path_b) or []

count_a = count_records(json_a)
count_b = count_records(json_b)
diff = count_b - count_a

# Print a short comparison summary
print(f"Archivo A: {path_a}{count_a} entradas")
print(f"Archivo B: {path_b}{count_b} entradas")
if path_a.samefile(path_b):
    print("Nota: ambos paths apuntan al mismo archivo.")
print(f"Diferencia (B - A): {diff}")

# Mostrar una pequeña tabla resumen con pandas + itables
summary_df = pd.DataFrame({
    'archivo': [str(path_a), str(path_b)],
    'entradas': [count_a, count_b]
})
display(HTML("<h4>Resumen de conteo</h4>"))
itables.show(summary_df, classes='stripe hover order-column', options={'paging': False, 'searching': False})

# Finalmente, mostrar el contenido del JSON A como DataFrame para exploración
display(HTML("<h4>Exploración: contenido de ./data/processed/pubmed_filtered_search_results.json</h4>"))

def extract_record_list(obj):
    """Return a list of records from common JSON shapes."""
    if obj is None:
        return []
    if isinstance(obj, list):
        return obj
    if isinstance(obj, dict):
        # Pick the largest list value if any (e.g. 'articles', 'items', 'results')
        list_values = [v for v in obj.values() if isinstance(v, list)]
        if list_values:
            return max(list_values, key=len)
        # Otherwise treat the dict itself as a single record
        return [obj]
    # Fallback: wrap scalar into list
    return [obj]

# Replace the previous normalization attempt with:
records = extract_record_list(json_b)

if not records:
    print("No records found in JSON to normalize.")
else:
    try:
        df_raw = pd.json_normalize(records)
    except Exception:
        # Last-resort: wrap the whole object
        df_raw = pd.DataFrame(records)

    # Keep display compact: randomize order and limit columns width (CSS)
    df_raw = df_raw.sample(frac=1).reset_index(drop=True)
    css = """
    <style>
    table.dataTable td, table.dataTable th { max-width: 180px; white-space: nowrap; overflow: hidden; text-overflow: ellipsis; }
    div.dataTables_wrapper { overflow-x: auto; }
    </style>
    """
    display(HTML(css))
    itables.show(df_raw, classes='stripe hover order-column',
                 options={'autoWidth': False,
                          'columnDefs': [{'targets': '_all', 'width': '180px'}],
                          'scrollX': True, 'pageLength': 25})
Archivo A: data/raw/all_ifc_publications.json → 404 entradas
Archivo B: data/processed/pubmed_filtered_search_results.json → 852 entradas
Diferencia (B - A): 448

Resumen de conteo

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Exploración: contenido de ./data/processed/pubmed_filtered_search_results.json

Loading ITables v2.5.2 from the init_notebook_mode cell... (need help?)

Embeddings

Probado con:

🔥 PyTorch: 2.8.0+cu128 🖥️ CUDA 🎮 GPU: NVIDIA GeForce RTX 3050 Laptop GPU 💾 GPU: 4.0 GB

EmbeddingGemma

Parámetros:

🔧 cuda: True 📥 Modelo EmbeddingGemma: ‘google/embeddinggemma-300M’ 📊 Parámetros del modelo: ‘307,581,696’

Code
from pathlib import Path
from IPython.display import HTML, IFrame, display

viz_path = Path('./notebooks/notebooks/data/processed/embeddinggemma_visualization_a.html')

if viz_path.exists():
    try:
        # Try embedding raw HTML so interactive JS/CSS stays inline
        display(HTML(viz_path.read_text(encoding='utf-8')))
    except Exception:
        # Fallback to an iframe if direct embedding fails
        display(IFrame(src=str(viz_path), width='100%', height=700))
else:
    print(f"Visualization not found: {viz_path}")

Clusters

Se utilizó la técnica de TF-IDF (Term Frequency-Inverse Document Frequency), utilizada en procesamiento de lenguaje natural (NLP),para identificar palabras clave o términos representativos.

Note
  1. Frecuencia de término (TF): Mide cuántas veces aparece un término en un documento en relación con la longitud total del documento. Esto ayuda a identificar términos frecuentes dentro de un documento.

\[ TF(t) = \frac{\text{Número de veces que aparece el término } t}{\text{Número total de términos en el documento}} \]

  1. Frecuencia inversa de documento (IDF): Mide qué tan único es un término en el corpus. Si un término aparece en muchos documentos, su IDF será bajo, ya que no es distintivo. \[ IDF(t) = \log\left(\frac{\text{Número total de documentos}}{\text{Número de documentos que contienen el término } t}\right) \]

  2. TF-IDF: Es el producto de TF e IDF. Los términos con un TF-IDF alto son aquellos que son frecuentes en un documento pero raros en el resto del corpus, lo que los hace representativos.

\[ TF-IDF(t) = TF(t) \times IDF(t) \]

Code
from pathlib import Path
from IPython.display import HTML, IFrame, display
import re

viz_path = Path('notebooks/notebooks/data/processed/embeddinggemma_visualization_cluster.html')

if not viz_path.exists():
    print(f"Visualization not found: {viz_path}")
else:
    html_text = viz_path.read_text(encoding='utf-8')

    # Try to embed the full HTML inline (best for preserving JS/CSS)
    try:
        display(HTML(html_text))
    except Exception:
        # If inline embedding fails (often due to heavy JS), show a static table (if present)
        table_match = re.search(r'(<table[\s\S]*?>[\s\S]*?</table>)', html_text, flags=re.IGNORECASE)
        if table_match:
            display(HTML(table_match.group(1)))
        # Always provide an iframe fallback for the interactive plot/visualization
        display(IFrame(src=str(viz_path), width='100%', height=800))
Code
from pathlib import Path
from IPython.display import HTML, IFrame, display
import re

html_path = Path('notebooks/notebooks/data/processed/embeddinggemma_cluster_summary.html')

if not html_path.exists():
    print(f"Visualization not found: {html_path}")
else:
    html_text = html_path.read_text(encoding='utf-8')
    table_html = None

    # Prefer BeautifulSoup if available for robust extraction
    try:
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html_text, 'html.parser')
        tables = soup.find_all('table')
        if tables:
            table_html = ''.join(str(t) for t in tables)
    except Exception:
        # Fallback to regex if BeautifulSoup is not installed
        m = re.search(r'(<table[\s\S]*?>[\s\S]*?</table>)', html_text, flags=re.IGNORECASE)
        if m:
            table_html = m.group(1)

    if table_html:
        # Wrap table in a responsive container to allow horizontal scrolling
        display(HTML(f'<div style="overflow-x:auto">{table_html}</div>'))
    else:
        # If no table was found, embed the full HTML as an iframe
        display(HTML("<p>No &lt;table&gt; found in the visualization file. Showing full HTML below.</p>"))
        display(IFrame(src=str(html_path), width='100%', height=800))
cluster size top_terms top_journal_example year_range
0 135 species, bacterial, host, bacteria, microbial, study PloS one 1992 - 2025
1 135 type, caused, human, editorial, effects, genus Nature 1995 - 2025
2 231 cell, cancer, cells, expression, protein, disease International journal of molecular sciences 2001 - 2025
3 142 patients, study, clinical, outcomes, risk, care Open forum infectious diseases 2000 - 2025
4 114 brain, neurons, cells, neural, neuronal, synaptic bioRxiv : the preprint server for biology 2017 - 2025
5 94 high, protein, applications, time, energy, method Physical review letters 2019 - 2025